refactor: remove devicearray code to reduce complexity#600
refactor: remove devicearray code to reduce complexity#600cpcloud merged 6 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
gmarkall
left a comment
There was a problem hiding this comment.
The code changes look good.
I need to spend a few minutes investigating a bit, because I am surprised that the "fast" path seemed to be slower than the default one. But, things have changed a lot since I initially implemented it, so it's also plausible we've drifted into a place where it no longer helps.
I will post benchmarks in a bit, but the difference isn't huge. I wouldn't have been surprised if it were the other way around (which was my initial interpretation until I realized it was the other way around). |
c9b0501 to
6628ea7
Compare
|
/ok to test |
1 similar comment
|
/ok to test |
6628ea7 to
f320ce6
Compare
|
Benchmarks are a little variable (here the best improvement is around 7%)
|
Greptile OverviewGreptile SummaryThis PR successfully removes the C++ extension-based
The implementation is correct and safe. The Confidence Score: 5/5
Important Files ChangedFile Analysis
|
My understanding is that the opposite is shown - the "fast path" being removed here is actually 0-4% slower than the "fallback", so removing this code is both a performance and complexity improvement. |
|
I've been experimenting with this locally, in two configurations:
The idea being that we're comparing With With this branch, I get: So it looks like performance is mostly the same, except for in I'm looking into why this could be. |
|
Also, it seems that this branch is faster in all cases with the |
d598a52 to
aa2a2bf
Compare
aa2a2bf to
2fd2e58
Compare
2fd2e58 to
3bca842
Compare
868376c to
b89d7fd
Compare
|
/ok to test |
b89d7fd to
8a6657d
Compare
|
/ok to test |
There was a problem hiding this comment.
Additional Comments (2)
-
numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 42 (link)syntax: IDs are swapped -
cuda.jitis dispatch mode,cuda.jit("void(float32[::1])")is signature mode -
numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 96 (link)syntax: IDs are swapped here too -
cuda.jitis dispatch mode,cuda.jit("void(...)")is signature mode
8 files reviewed, 2 comments
|
/ok to test |
e4101fe to
5d4c45b
Compare
5d4c45b to
edfdf00
Compare
|
/ok to test |
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)
- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)
v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)



Overview
This PR removes the C++ extension-based
DeviceArrayclass and replaces it with a pure Python implementation, significantly reducing codebase complexity while maintaining functionality. The changes also introduce type computation caching to mitigate potential performance overhead.Changes
Implementation Changes
numba_cuda/numba/cuda/cudadrv/devicearray.py
DeviceNDArrayBasenow inherits from object instead of_devicearray.DeviceArray_numba_type_instance attribute caching in_numba_type_property to avoid repeated type computationPerformance Considerations
Removed Optimization: The C++ implementation provided fast-path type fingerprinting through direct table lookup for device arrays during dispatch:
[ndim-1][layout][dtype]Mitigation: Instance-level caching of
_numba_type_property reduces repeated type computation overhead:DeviceNDArrayBaseinstanceExpected Impact:
StridedMemoryView-based oneTrade-offs
Benefits:
Costs: